2 Grammar of Graphics
2.1 The Tidy Approach
2.1.1 Opinionated software
Opinionated software is a software product that believes a certain way of approaching a business process is inherently better and provides software crafted around that approach. ~ Stuart Eccles
2.1.2 Tidy data
The defining opinion of the tidyverse is its wholehearted adoption of tidy data. Tidy data has three features:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a dataframe. (This is from the paper, not the book)
Source: R for Data Science
Tidy data was formalized by Hadley Wickham in “Tidy Data” in the Journal of Statistical Software in 2014. It is equivalent to Codd’s 3rd normal form (Codd, 1990) for relational databases.
Tidy datasets are all alike, but every messy dataset is messy in its own way. ~ Hadley Wickham
The tidy approach to data science is powerful because it breaks data work into two distinct parts.
- First, get the data into a tidy format.
- Second, use tools optimized for tidy data.
By standardizing the data structure for most community-created tools, the framework oriented diffuse development and reduced the friction of data work.
2.2 Grammar of Graphics
ggplot2 is an R package for data visualization that was developed during Hadley Wickham’s graduate studies at Iowa State University. ggplot2 is formalized in “A Layered Grammar of Graphics” by Hadley Wickham, which was published in the Journal of Statistical Software in 2010.
The grammar of graphics, originally by Leland Wilkinson, is a theoretical framework that breaks all data visualizations into their component pieces. With the layered grammar of graphics, Wickham extends Wilkinson’s grammar of graphics and implements it in R. The cohesion is impressive and the theory flows to the code which informs the data visualization process in a way not reflected in any other data viz tool.
There are eight main ingredients to the grammar of graphics. We will work our way through the ingredients with many hands-on examples.
1 Data are the values represented in the visualization.
ggplot(data = ) or data %>% ggplot()
# A tibble: 19,537 × 7
name year category lat long wind pressure
<chr> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 Amy 1975 NA 27.5 -79 25 1013
2 Amy 1975 NA 28.5 -79 25 1013
3 Amy 1975 NA 29.5 -79 25 1013
4 Amy 1975 NA 30.5 -79 25 1013
5 Amy 1975 NA 31.5 -78.8 25 1012
6 Amy 1975 NA 32.4 -78.7 25 1012
7 Amy 1975 NA 33.3 -78 25 1011
8 Amy 1975 NA 34 -77 30 1006
9 Amy 1975 NA 34.4 -75.8 35 1004
10 Amy 1975 NA 34 -74.8 40 1002
# ℹ 19,527 more rows
2 Aesthetic mappings are directions for how variables from the data are mapped to visual elements in the data visualization. Aesthetic mappings show variation in the data through variation in the data visualization. Aesthetic mappings include linking variables to the x-position, y-position, color, fill, shape, transparency, and size.
aes(x = , y = , color = )
X or Y
Color or Fill
Size
Shape
Others: transparency, line type
3 Geometric objects are representations of the data, including points, lines, and polygons.
Plots are often called their geometric object(s).
geom_bar() or geom_col()
geom_line()
geom_point()
Aesthetic mappings like x and y almost always vary with the data. Aesthetic mappings like color, fill, shape, transparency, and size can vary with the data. But those arguments can also be added as styles that don’t vary with the data. If you include those arguments in aes(), they will show up in the legend (which can be annoying! and is also a sign that something should be changed!).
4 Scales control the exact behaviors of aesthetic mapping. scale_*_*() functions can change:
- The range and labels on the x-axis and y-axis
- The colors used for
colorandfill - The sizes of shapes
- Shapes
There are dozens of scale functions and their names follow a formula:
- They all start with
scale_. - Next, comes the name of the aesthetic for the scale (i.e.
x,y,fill,size, etc.). - Finally, comes the type of variable or transformation (i.e.
discrete,continuous, andreverse).
scale_x_continuous() and scale_y_continuous() are two popular scale_*_*() functions.
Before
scale_x_continuous()
After
scale_x_reverse()
Before
scale_size_continuous(breaks = c(25, 75, 125))
After
scale_size_continuous(range = c(0.5, 20), breaks = c(25, 75, 125))
5 Coordinate systems map scaled geometric objects to the position of objects on the plane of a plot. The two most popular coordinate systems are the Cartesian coordinate system and the polar coordinate system.
coord_polar()
6 Facets (optional) break data into meaningful subsets. facet_wrap(), facet_grid(), and facet_geo().
facet_wrap()
facet_wrap(~ category)
facet_grid()
facet_grid(month ~ year)
7 Statistical transformations (optional) transform the data, typically through summary statistics and functions, before aesthetic mapping.
Before transformations, each observation in data is represented by one geometric object (i.e. a scatter plot). After a transformation, a geometric object can represent more than one observation (i.e. a bar in a histogram).
Note: geom_bar() performs statistical transformation. Use geom_col() to create a column chart with bars that encode individual observations in the data set.
2.2.1 Themes
8 Theme controls the visual style of plot with font types, font sizes, background colors, margins, and positioning.
Default theme
Theme Minimal
fivethirtyeight theme
urbnthemes
If you prefer the minimal theme, you can add theme_minimal() to each visualization or add theme_set(theme_minimal) at the beginning of your script.
Layers allow for distinct geometric objects and/or distinct data sets to be combined in the same data visualization.
Inheritances pass aesthetic mappings from ggplot() to later geom_*() functions.
Notice how the aesthetic mappings are passed to ggplot() in example 9. This is useful when using layers!
3 Review
3.0.1 Theory
- Data
- Aesthetic mappings
- Geometric objects
- Scales
- Coordinate systems
- Facets
- Statistical transformations
- Theme
3.0.2 Functions
ggplot()aes()geom_*()geom_point()geom_line()geom_col()
scale_*_*()scale_y_continuous()
coord_*()facet_*()labs()ggsave()